-
Notifications
You must be signed in to change notification settings - Fork 962
Require dtype argument to cudf_polars Column
container
#19193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require dtype argument to cudf_polars Column
container
#19193
Conversation
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
…eschke/cudf into feat/cudf_polars/struct_expr
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
if dtype_str.startswith("list["): | ||
stripped = dtype_str.removeprefix("list[").removesuffix("]") | ||
return pl.List(_dtype_short_repr_to_dtype(stripped)) | ||
return pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the return type here: I see that https://github.com/pola-rs/polars/blob/405c371c41d71e4463829062ba297e3378cfd85d/py-polars/polars/datatypes/convert.py returns a PolarsDataType | None
.
- How should we handle the
None
case (which seems to happen for invalid data types)? PolarsType
is defined asUnion["DataTypeClass", "DataType"]
. Do we need to worry about theDataTypeClass
variant ? I'm not exactly sure what this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle the None case (which seems to happen for invalid data types)?
If we allow Column
s to have an optional, None
, dtype I suppose a None
return here could still be valid, but that would hide the root cause of an invalid data type
I suppose we can raise/add an assert
here, but would it be better to do that during deserialization or before serialization?
Do we need to worry about the DataTypeClass variant ?
Good catch. Yes, it appears this method can return a polars.DataType
class (DataTypeClass
) and not necessarily an instance which is what our DataType
container expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can raise/add an assert here, but would it be better to do that during deserialization or before serialization?
I think raising an error here (in the deserialization) is appropriate. If we're unable to parse a dtype then presumably something has gone very wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice overall. One question about some of the dtype deserialization logic.
I'll follow up with a PR ensuring that we have sufficient test coverage for the serialization.
pl_type = pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str) | ||
if pl_type is None: | ||
raise ValueError(f"{dtype_str} was not able to be parsed by Polars.") | ||
return pl_type() if inspect.isclass(pl_type) else pl_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How safe is pl_type()
, without any arguments, here? Some types (Array
, Enum
) require additional arguments. Maybe we don't support those yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the types we support in DataType
, I believe this is fairly safe as I'm hoping that dtype_short_repr_to_dtype
will return instances for types with parameters (polars.Datetime
and polars.Duration
).
For those types that we don't support that take arguments, those should be rejected when constructing a DataType
/merge |
With #19193 and this PR, we'll not import `pyarrow` explicitly in `cudf_polars` xref #18534 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #19219
In https://github.com/rapidsai/cudf/pull/18953/files#r2167631601, we found an issue with (de)serialization of cudf_polars data types with a multi-GPU setup. Polars lets you use either datatype classes or instances in most places. These are similar in most ways (hash the same, compare equal) but our deserialization implementation relies on isinstance checks which are not the same. This could cause issues if we tried to deserialize a type like `'i8'` that wasn't already in our cache. This simulates that situation on a single worker by clearing the cache before deserialzation. This was fixed in rapidsai#19193, but I've added a regression test here.
In https://github.com/rapidsai/cudf/pull/18953/files#r2167631601, we found an issue with (de)serialization of cudf_polars data types with a multi-GPU setup. Polars lets you use either datatype classes or instances in most places. These are similar in most ways (hash the same, compare equal) but our deserialization implementation relies on isinstance checks which are not the same. This could cause issues if we tried to deserialize a type like `'i8'` that wasn't already in our cache. This simulates that situation on a single worker by clearing the cache before deserialzation. This was fixed in #19193, but I've added a regression test here. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #19224
Needs #19193 This PR does not necessarily support Polars expressions that manipulate timezone information but ensure we can consume and return Polars Datetime types with a `time_zone`. We'll need to preserve the timezone name in the `column_metadata` struct in order to pass it to the Arrow schema generated by nanoarrow. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #19155
Description
Depends on #19075
Following #19091, this PR ensure the
Column
always contains aDataType
object such that Polars type metadata such as struct field names are preservedChecklist